{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Cross-validation framework\n",
    "\n",
    "In the previous notebooks, we introduce some concepts regarding the evaluation\n",
    "of predictive models. While this section could be slightly redundant, we\n",
    "intend to go into details into the cross-validation framework.\n",
    "\n",
    "Before we dive in, let's focus on the reasons for always having training and\n",
    "testing sets. Let's first look at the limitation of using a dataset without\n",
    "keeping any samples out.\n",
    "\n",
    "To illustrate the different concepts, we will use the California housing\n",
    "dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.datasets import fetch_california_housing\n",
    "\n",
    "housing = fetch_california_housing(as_frame=True)\n",
    "data, target = housing.data, housing.target"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this dataset, the aim is to predict the median value of houses in an area\n",
    "in California. The features collected are based on general real-estate and\n",
    "geographical information.\n",
    "\n",
    "Therefore, the task to solve is different from the one shown in the previous\n",
    "notebook. The target to be predicted is a continuous variable and not anymore\n",
    "discrete. This task is called regression.\n",
    "\n",
    "Thus, we will use a predictive model specific to regression and not to\n",
    "classification."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(housing.DESCR)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To simplify future visualization, let's transform the prices from the 100\n",
    "(k\\\\$) range to the thousand dollars (k\\\\$) range."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "target *= 100\n",
    "target"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"admonition note alert alert-info\">\n",
    "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Note</p>\n",
    "<p class=\"last\">If you want a deeper overview regarding this dataset, you can refer to the\n",
    "Appendix - Datasets description section at the end of this MOOC.</p>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training error vs testing error\n",
    "\n",
    "To solve this regression task, we will use a decision tree regressor."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.tree import DecisionTreeRegressor\n",
    "\n",
    "regressor = DecisionTreeRegressor(random_state=0)\n",
    "regressor.fit(data, target)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After training the regressor, we would like to know its potential\n",
    "generalization performance once deployed in production. For this purpose, we\n",
    "use the mean absolute error, which gives us an error in the native unit, i.e.\n",
    "k\\\\$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import mean_absolute_error\n",
    "\n",
    "target_predicted = regressor.predict(data)\n",
    "score = mean_absolute_error(target, target_predicted)\n",
    "print(f\"On average, our regressor makes an error of {score:.2f} k$\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lines_to_next_cell": 0
   },
   "source": [
    "We get perfect prediction with no error. It is too optimistic and almost\n",
    "always revealing a methodological problem when doing machine learning.\n",
    "\n",
    "Indeed, we trained and predicted on the same dataset. Since our decision tree\n",
    "was fully grown, every sample in the dataset is stored in a leaf node.\n",
    "Therefore, our decision tree fully memorized the dataset given during `fit`\n",
    "and therefore made no error when predicting.\n",
    "\n",
    "This error computed above is called the **empirical error** or **training\n",
    "error**.\n",
    "\n",
    "<div class=\"admonition note alert alert-info\">\n",
    "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Note</p>\n",
    "<p class=\"last\">In this MOOC, we will consistently use the term \"training error\".</p>\n",
    "</div>\n",
    "\n",
    "We trained a predictive model to minimize the training error but our aim is to\n",
    "minimize the error on data that has not been seen during training.\n",
    "\n",
    "This error is also called the **generalization error** or the \"true\" **testing\n",
    "error**.\n",
    "\n",
    "<div class=\"admonition note alert alert-info\">\n",
    "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Note</p>\n",
    "<p class=\"last\">In this MOOC, we will consistently use the term \"testing error\".</p>\n",
    "</div>\n",
    "\n",
    "Thus, the most basic evaluation involves:\n",
    "\n",
    "* splitting our dataset into two subsets: a training set and a testing set;\n",
    "* fitting the model on the training set;\n",
    "* estimating the training error on the training set;\n",
    "* estimating the testing error on the testing set.\n",
    "\n",
    "So let's split our dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "data_train, data_test, target_train, target_test = train_test_split(\n",
    "    data, target, random_state=0\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, let's train our model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "regressor.fit(data_train, target_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we estimate the different types of errors. Let's start by computing\n",
    "the training error."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "target_predicted = regressor.predict(data_train)\n",
    "score = mean_absolute_error(target_train, target_predicted)\n",
    "print(f\"The training error of our model is {score:.2f} k$\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We observe the same phenomena as in the previous experiment: our model\n",
    "memorized the training set. However, we now compute the testing error."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "target_predicted = regressor.predict(data_test)\n",
    "score = mean_absolute_error(target_test, target_predicted)\n",
    "print(f\"The testing error of our model is {score:.2f} k$\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This testing error is actually about what we would expect from our model if it\n",
    "was used in a production environment."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Stability of the cross-validation estimates\n",
    "\n",
    "When doing a single train-test split we don't give any indication regarding\n",
    "the robustness of the evaluation of our predictive model: in particular, if\n",
    "the test set is small, this estimate of the testing error will be unstable and\n",
    "wouldn't reflect the \"true error rate\" we would have observed with the same\n",
    "model on an unlimited amount of test data.\n",
    "\n",
    "For instance, we could have been lucky when we did our random split of our\n",
    "limited dataset and isolated some of the easiest cases to predict in the\n",
    "testing set just by chance: the estimation of the testing error would be\n",
    "overly optimistic, in this case.\n",
    "\n",
    "**Cross-validation** allows estimating the robustness of a predictive model by\n",
    "repeating the splitting procedure. It will give several training and testing\n",
    "errors and thus some **estimate of the variability of the model generalization\n",
    "performance**.\n",
    "\n",
    "There are [different cross-validation\n",
    "strategies](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-iterators),\n",
    "for now we are going to focus on one called \"shuffle-split\". At each iteration\n",
    "of this strategy we:\n",
    "\n",
    "- randomly shuffle the order of the samples of a copy of the full dataset;\n",
    "- split the shuffled dataset into a train and a test set;\n",
    "- train a new model on the train set;\n",
    "- evaluate the testing error on the test set.\n",
    "\n",
    "We repeat this procedure `n_splits` times. Keep in mind that the computational\n",
    "cost increases with `n_splits`.\n",
    "\n",
    "![Cross-validation diagram](../figures/shufflesplit_diagram.png)\n",
    "\n",
    "<div class=\"admonition note alert alert-info\">\n",
    "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Note</p>\n",
    "<p class=\"last\">This figure shows the particular case of <strong>shuffle-split</strong> cross-validation\n",
    "strategy using <tt class=\"docutils literal\">n_splits=5</tt>.\n",
    "For each cross-validation split, the procedure trains a model on all the red\n",
    "samples and evaluate the score of the model on the blue samples.</p>\n",
    "</div>\n",
    "\n",
    "In this case we will set `n_splits=40`, meaning that we will train 40 models\n",
    "in total and all of them will be discarded: we just record their\n",
    "generalization performance on each variant of the test set.\n",
    "\n",
    "To evaluate the generalization performance of our regressor, we can use\n",
    "[`sklearn.model_selection.cross_validate`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html)\n",
    "with a\n",
    "[`sklearn.model_selection.ShuffleSplit`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.ShuffleSplit.html)\n",
    "object:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import cross_validate\n",
    "from sklearn.model_selection import ShuffleSplit\n",
    "\n",
    "cv = ShuffleSplit(n_splits=40, test_size=0.3, random_state=0)\n",
    "cv_results = cross_validate(\n",
    "    regressor, data, target, cv=cv, scoring=\"neg_mean_absolute_error\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The results `cv_results` are stored into a Python dictionary. We will convert\n",
    "it into a pandas dataframe to ease visualization and manipulation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "cv_results = pd.DataFrame(cv_results)\n",
    "cv_results"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"admonition tip alert alert-warning\">\n",
    "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Tip</p>\n",
    "<p>A score is a metric for which higher values mean better results. On the\n",
    "contrary, an error is a metric for which lower values mean better results.\n",
    "The parameter <tt class=\"docutils literal\">scoring</tt> in <tt class=\"docutils literal\">cross_validate</tt> always expect a function that is\n",
    "a score.</p>\n",
    "<p class=\"last\">To make it easy, all error metrics in scikit-learn, like\n",
    "<tt class=\"docutils literal\">mean_absolute_error</tt>, can be transformed into a score to be used in\n",
    "<tt class=\"docutils literal\">cross_validate</tt>. To do so, you need to pass a string of the error metric\n",
    "with an additional <tt class=\"docutils literal\">neg_</tt> string at the front to the parameter <tt class=\"docutils literal\">scoring</tt>;\n",
    "for instance <tt class=\"docutils literal\"><span class=\"pre\">scoring=\"neg_mean_absolute_error\"</span></tt>. In this case, the negative\n",
    "of the mean absolute error will be computed which would be equivalent to a\n",
    "score.</p>\n",
    "</div>\n",
    "\n",
    "Let us revert the negation to get the actual error:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cv_results[\"test_error\"] = -cv_results[\"test_score\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's check the results reported by the cross-validation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cv_results.head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We get timing information to fit and predict at each cross-validation\n",
    "iteration. Also, we get the test score, which corresponds to the testing error\n",
    "on each of the splits."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "len(cv_results)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We get 40 entries in our resulting dataframe because we performed 40 splits.\n",
    "Therefore, we can show the testing error distribution and thus, have an\n",
    "estimate of its variability."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "\n",
    "cv_results[\"test_error\"].plot.hist(bins=10, edgecolor=\"black\")\n",
    "plt.xlabel(\"Mean absolute error (k$)\")\n",
    "_ = plt.title(\"Test error distribution\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We observe that the testing error is clustered around 47 k\\\\$ and ranges from\n",
    "43 k\\\\$ to 50 k\\\\$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\n",
    "    \"The mean cross-validated testing error is: \"\n",
    "    f\"{cv_results['test_error'].mean():.2f} k$\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\n",
    "    \"The standard deviation of the testing error is: \"\n",
    "    f\"{cv_results['test_error'].std():.2f} k$\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that the standard deviation is much smaller than the mean: we could\n",
    "summarize that our cross-validation estimate of the testing error is 46.36 \u00b1\n",
    "1.17 k\\\\$.\n",
    "\n",
    "If we were to train a single model on the full dataset (without\n",
    "cross-validation) and then later had access to an unlimited amount of test\n",
    "data, we would expect its true testing error to fall close to that region.\n",
    "\n",
    "While this information is interesting in itself, it should be contrasted to\n",
    "the scale of the natural variability of the vector `target` in our dataset.\n",
    "\n",
    "Let us plot the distribution of the target variable:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "target.plot.hist(bins=20, edgecolor=\"black\")\n",
    "plt.xlabel(\"Median House Value (k$)\")\n",
    "_ = plt.title(\"Target distribution\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(f\"The standard deviation of the target is: {target.std():.2f} k$\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The target variable ranges from close to 0 k\\\\$ up to 500 k\\\\$ and, with a\n",
    "standard deviation around 115 k\\\\$.\n",
    "\n",
    "We notice that the mean estimate of the testing error obtained by\n",
    "cross-validation is a bit smaller than the natural scale of variation of the\n",
    "target variable. Furthermore, the standard deviation of the cross validation\n",
    "estimate of the testing error is even smaller.\n",
    "\n",
    "This is a good start, but not necessarily enough to decide whether the\n",
    "generalization performance is good enough to make our prediction useful in\n",
    "practice.\n",
    "\n",
    "We recall that our model makes, on average, an error around 47 k\\\\$. With this\n",
    "information and looking at the target distribution, such an error might be\n",
    "acceptable when predicting houses with a 500 k\\\\$. However, it would be an\n",
    "issue with a house with a value of 50 k\\\\$. Thus, this indicates that our\n",
    "metric (Mean Absolute Error) is not ideal.\n",
    "\n",
    "We might instead choose a metric relative to the target value to predict: the\n",
    "mean absolute percentage error would have been a much better choice.\n",
    "\n",
    "But in all cases, an error of 47 k\\\\$ might be too large to automatically use\n",
    "our model to tag house values without expert supervision.\n",
    "\n",
    "## More detail regarding `cross_validate`\n",
    "\n",
    "During cross-validation, many models are trained and evaluated. Indeed, the\n",
    "number of elements in each array of the output of `cross_validate` is a result\n",
    "from one of these `fit`/`score` procedures. To make it explicit, it is\n",
    "possible to retrieve these fitted models for each of the splits/folds by\n",
    "passing the option `return_estimator=True` in `cross_validate`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cv_results = cross_validate(regressor, data, target, return_estimator=True)\n",
    "cv_results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cv_results[\"estimator\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The five decision tree regressors corresponds to the five fitted decision\n",
    "trees on the different folds. Having access to these regressors is handy\n",
    "because it allows to inspect the internal fitted parameters of these\n",
    "regressors.\n",
    "\n",
    "In the case where you only are interested in the test score, scikit-learn\n",
    "provide a `cross_val_score` function. It is identical to calling the\n",
    "`cross_validate` function and to select the `test_score` only (as we\n",
    "extensively did in the previous notebooks)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import cross_val_score\n",
    "\n",
    "scores = cross_val_score(regressor, data, target)\n",
    "scores"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summary\n",
    "\n",
    "In this notebook, we saw:\n",
    "\n",
    "* the necessity of splitting the data into a train and test set;\n",
    "* the meaning of the training and testing errors;\n",
    "* the overall cross-validation framework with the possibility to study\n",
    "  generalization performance variations."
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "main_language": "python"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}